Apache Arrow DataFusion 16.0.0 Project Update

您所在的位置：网站首页 › open files limit › Apache Arrow DataFusion 16.0.0 Project Update

Apache Arrow DataFusion 16.0.0 Project Update

#Apache Arrow DataFusion 16.0.0 Project Update | 来源: 网络整理| 查看: 265

Published 19 Jan 2023 By The Apache Arrow PMC (pmc)

Introduction

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.

DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.

The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.

Community Growth

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

Several new systems based on DataFusion were recently added:

Greptime DB Synnada PRQL Parseable SeaFowl Performance 🚀

Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:

Up to 30% Faster Sorting and Merging using the new Row Format Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering. 70% faster IN expressions evaluation (#4057) Sort and partition aware optimizations (#3969 and #4691) Filter selectivity analysis (#3868) Runtime Resource Limits

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

In version 16.0.0, it is possible to limit DataFusion’s memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

SQL Window Functions

SQL Window Functions are useful for a variety of analysis and DataFusion’s implementation support expanded significantly:

Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING) Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING) Support for the NTILE window function (#4676) Support for GROUPS mode (#4155) Improved Joins

Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as

Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219) Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y Better performance on non-equijoins (#4562) Streaming Execution

One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.

With this release, DataFusion now supports the following streaming features:

Data ingestion from infinite files such as FIFOs (#4694), Detection of pipeline-breaking queries in streaming use cases (#4694), Automatic input swapping for joins so probe side is a data stream (#4694), Intelligent elision of pipeline-breaking sort operations whenever possible (#4691), Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).

These are a major steps forward, and we plan even more improvements over the next few releases.

Better Support for Distributed Catalogs

16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.

Additional SQL Support

SQL support continues to improve, including some of these highlights:

Add TPC-DS query planning regression tests #4719 Support for PREPARE statement #4490 Automatic coercions ast between Date and Timestamp #4726 Support type coercion for timestamp and utf8 #4312 Full support for time32 and time64 literal values (ScalarValue) #4156 New functions, incuding uuid() #4041, current_time #4054, current_date #4022 Compressed CSV/JSON support #3642

The community has also invested in new sqllogic based tests to keep improving DataFusion’s quality with less effort.

Plan Serialization and Substrait

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

How to Get Involved

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

113 Andrew Lamb 58 jakevin 46 Raphael Taylor-Davies 30 Andy Grove 19 Batuhan Taskaya 19 Remzi Yang 17 ygf11 16 Burak 16 Jeffrey 16 Marco Neumann 14 Kun Liu 12 Yang Jiang 10 mingmwang 9 Daniël Heres 9 Mustafa akur 9 comphead 9 mvanschellebeeck 9 xudong.w 7 dependabot[bot] 7 yahoNanJing 6 Brent Gardner 5 AssHero 4 Jiayu Liu 4 Wei-Ting Kuo 4 askoa 3 André Calado Coroado 3 Jie Han 3 Jon Mease 3 Metehan Yıldırım 3 Nga Tran 3 Ruihang Xia 3 baishen 2 Berkay Şahin 2 Dan Harris 2 Dongyan Zhou 2 Eduard Karacharov 2 Kikkon 2 Liang-Chi Hsieh 2 Marko Milenković 2 Martin Grigorov 2 Roman Nozdrin 2 Tim Van Wassenhove 2 r.4ntix 2 unconsolable 2 unvalley 1 Ajaya Agrawal 1 Alexander Spies 1 ArkashaJavelin 1 Artjoms Iskovs 1 BoredPerson 1 Christian Salvati 1 Creampanda 1 Data Psycho 1 Francis Du 1 Francis Le Roy 1 LFC 1 Marko Grujic 1 Matt Willian 1 Matthijs Brobbel 1 Max Burke 1 Mehmet Ozan Kabak 1 Rito Takeuchi 1 Roman Zeyde 1 Vrishabh 1 Zhang Li 1 ZuoTiJia 1 byteink 1 cfraz89 1 nbr 1 xxchan 1 yujie.zhang 1 zembunia 1 哇呜哇呜呀咦耶

【本文地址】

Apache Arrow DataFusion 16.0.0 Project Update

Apache Arrow DataFusion 16.0.0 Project Update

今日新闻

推荐新闻